AITopics | speech generator

Collaborating Authors

speech generator

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

The NPU-HWC System for the ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge

Guo, Dake, Yao, Jixun, Zhu, Xinfa, Xia, Kangxiang, Guo, Zhao, Zhang, Ziyu, Wang, Yao, Liu, Jie, Xie, Lei

arXiv.org Artificial IntelligenceOct-31-2024

This paper presents the NPU-HWC system submitted to the ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge 2024 (ICAGC). Our system consists of two modules: a speech generator for Track 1 and a background audio generator for Track 2. In Track 1, we employ Single-Codec to tokenize the speech into discrete tokens and use a language-model-based approach to achieve zero-shot speaking style cloning. The Single-Codec effectively decouples timbre and speaking style at the token level, reducing the acoustic modeling burden on the autoregressive language model. Additionally, we use DSPGAN to upsample 16 kHz mel-spectrograms to high-fidelity 48 kHz waveforms. In Track 2, we propose a background audio generator based on large language models (LLMs). This system produces scene-appropriate accompaniment descriptions, synthesizes background audio with Tango 2, and integrates it with the speech generated by our Track 1 system. Our submission achieves the second place and the first place in Track 1 and Track 2 respectively.

language model, speech, track 1, (13 more...)

arXiv.org Artificial Intelligence

2410.23815

Country: Asia > China > Shaanxi Province > Xi'an (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Not My Voice! A Taxonomy of Ethical and Safety Harms of Speech Generators

Hutiri, Wiebke, Papakyriakopoulos, Oresiti, Xiang, Alice

arXiv.org Artificial IntelligenceJan-25-2024

The rapid and wide-scale adoption of AI to generate human speech poses a range of significant ethical and safety risks to society that need to be addressed. For example, a growing number of speech generation incidents are associated with swatting attacks in the United States, where anonymous perpetrators create synthetic voices that call police officers to close down schools and hospitals, or to violently gain access to innocent citizens' homes. Incidents like this demonstrate that multimodal generative AI risks and harms do not exist in isolation, but arise from the interactions of multiple stakeholders and technical AI systems. In this paper we analyse speech generation incidents to study how patterns of specific harms arise. We find that specific harms can be categorised according to the exposure of affected individuals, that is to say whether they are a subject of, interact with, suffer due to, or are excluded from speech generation systems. Similarly, specific harms are also a consequence of the motives of the creators and deployers of the systems. Based on these insights we propose a conceptual framework for modelling pathways to ethical and safety harms of AI, which we use to develop a taxonomy of harms of speech generators. Our relational approach captures the complexity of risks and harms in sociotechnical AI systems, and yields an extensible taxonomy that can support appropriate policy interventions and decision making for responsible multimodal model development and release of speech generators.

incident, speech generator, taxonomy, (13 more...)

arXiv.org Artificial Intelligence

2402.01708

Country:

Africa > Sudan (0.14)
Asia > South Korea (0.14)
South America > Venezuela (0.04)
(13 more...)

Genre: Research Report (0.50)

Industry:

Media (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law (1.00)
(4 more...)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(3 more...)

Add feedback

Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction

Kim, Minchan, Jeong, Myeonghun, Choi, Byoung Jin, Kim, Semin, Lee, Joun Yeop, Kim, Nam Soo

arXiv.org Artificial IntelligenceJan-2-2024

We propose a novel text-to-speech (TTS) framework centered around a neural transducer. Our approach divides the whole TTS pipeline into semantic-level sequence-to-sequence (seq2seq) modeling and fine-grained acoustic modeling stages, utilizing discrete semantic tokens obtained from wav2vec2.0 embeddings. For a robust and efficient alignment modeling, we employ a neural transducer named token transducer for the semantic token prediction, benefiting from its hard monotonic alignment constraints. Subsequently, a non-autoregressive (NAR) speech generator efficiently synthesizes waveforms from these semantic tokens. Additionally, a reference speech controls temporal dynamics and acoustic conditions at each stage. This decoupled framework reduces the training complexity of TTS while allowing each stage to focus on semantic and acoustic modeling. Our experimental results on zero-shot adaptive TTS demonstrate that our model surpasses the baseline in terms of speech quality and speaker similarity, both objectively and subjectively. We also delve into the inference speed and prosody control capabilities of our approach, highlighting the potential of neural transducers in TTS frameworks.

neural transducer, speech, transducer, (15 more...)

arXiv.org Artificial Intelligence

2401.01498

Country:

Asia > South Korea > Seoul > Seoul (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(3 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.94)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.87)
(3 more...)

Add feedback

Google's DeepMind Pays Off With AI That Mimics Human Speech

#artificialintelligenceSep-13-2016, 13:40:29 GMT

There are many speech generation programs that create artificial human speech such as Minitalk and eSpeak, but Google believes they have a system that outperforms current technology by 50%. Deepmind is a Google unit that is working on super-intelligent computers and they've created an artificial intelligence that can closely mimic human speech. The AI, called WaveNet, works by analyzing a human voice's actual sound waves. More archaic speech generators either use short clips of a previously recorded speaker or electronically generate speech based on how certain letter combinations are supposed to be pronounced. The results of both provide highly accurate speech, but they sound robotic and lack the fluidity of human diction.

artificial intelligence, deep learning, machine learning, (7 more...)

#artificialintelligence

Industry: Information Technology (0.36)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback